Towards Integrated Acoustic Models for Speech Synthesis

نویسندگان

  • Prasanna Kumar Muthukumar
  • Alan W Black
  • Bhiksha Raj
  • Richard Stern
  • H. Timothy Bunnell
چکیده

All Statistical Parametric Speech Synthesizers consist of a linear pipeline of components. This view means that the synthesizer consists of a top-down structure where data fed into the synthesizer goes to front-end, then to the prediction algorithm, then to the waveform generation, and so on until the speech is finally constructed. Each component in this pipeline naively receives a stream of numbers from the preceding component, and spits out a stream of numbers for the next one in line, with little to no knowledge of what happens in the larger scheme of the pipeline. In this thesis, I argue against this “Markovian” structure, and instead propose the idea of an Integrated structure. In an integrated structure, every component in the system influences, and is in turn influenced by every other component in the system. This thesis describes four sets of experiments that move towards this idea. The first involves using lexical information to improve waveform generation algorithms. The second tries to increase the interaction between prediction algorithms and waveform generation. The third is an attempt to derive phonemes and phonetic information automatically from the speech rather than from the text. The last, and probably hardest, describes an idea for an evaluation metric that pays attention to multiple components of the synthesizer, rather than focusing on just a single one.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prosodic models and speech synthesis: towards the common ground

Prosodic models have been extensively applied in speech synthesis. However, the necessity of synthesizing prosody has as yet not resulted in a generally agreed upon approach to prosodic modeling. This statement holds for the assignment of segmental durations as well as for generating F0 curves, the acoustic correlate of intonation contours. This paper concentrates on the use and usability of in...

متن کامل

First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention

In conventional neural networks (NN) based parametric text-tospeech (TTS) synthesis frameworks, text analysis and acoustic modeling are typically processed separately, leading to some limitations. On one hand, much significant human expertise is normally required in text analysis, which presents a laborious task for researchers; on the other hand, training of the NN-based acoustic models still ...

متن کامل

Towards Articulatory Speech Synthesis with a Dynamic 3D Finite Element Tongue Model

We describe work towards articulatory speech synthesis driven by realistic 3D tissue and bone models. The vocal tract shape is modeled using a fast 3D finite element method (FEM) of a muscle-activated human tongue in conjunction with fixed rigid models of jaw, hyoid and palate connected to a deformable mesh representing the airway. Actuation of the tissue model deforms the airway providing a ti...

متن کامل

Artisynth: an extensible, cross-platform 3d articulatory speech synthesizer

We describe our progress on the construction of a combined 3D face and vocal tract simulator for articulatory speech synthesis called ArtiSynth. The architecture provides six main modules: (1) a simulator engine and synthesis framework, (2) a two and three-dimensional model development component, (3) a numerics engine, (4) a graphical renderer, (5) an audio synthesis engine and (6) a graphical ...

متن کامل

Acoustic and Visual Analysis of Expressive Speech: A Case Study of French Acted Speech

Within the framework of developing an expressive audiovisual speech synthesis, an acoustic and visual analysis of expressive acted speech is proposed in this paper. Our purpose is to identify the main characteristics of audiovisual expressions that need to be integrated during synthesis to provide believable emotions to the virtual 3D talking head. We conducted a case study of a semi-profession...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015